AITopics | cross-attention layer

Collaborating Authors

cross-attention layer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

DAC-DETR: Divide the Attention Layers and Conquer

Neural Information Processing SystemsMay-1-2026, 05:05:51 GMT

This paper reveals a characteristic of DEtection Transformer (DETR) that negatively impacts its training efficacy, i.e., the cross-attention and self-attention layers in DETR decoder have opposing impacts on the object queries (though both impacts are important). Specifically, we observe the cross-attention tends to gather multiple queries around the same object, while the self-attention disperses these queries far away. To improve the training efficacy, we propose a Divide-And-Conquer DETR (DAC-DETR) that separates out the cross-attention to avoid these competing objectives. During training, DAC-DETR employs an auxiliary decoder that focuses on learning the cross-attention layers. The auxiliary decoder, while sharing all the other parameters, has NO self-attention layers and employs one-to-many label assignment to improve the gathering effect. Experiments show that DAC-DETR brings remarkable improvement over popular DETRs. For example, under the 12 epochs training scheme on MS-COCO, DAC-DETR improves Deformable DETR (ResNet50) by +3.4AP and achieves 50.9 (ResNet-50) / 58.1 AP (Swin-Large) based on some popular methods (i.e., DINO and an IoU-related loss).

artificial intelligence, machine learning, query, (16 more...)

Neural Information Processing Systems

Country: Europe > Switzerland (0.30)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Vision (0.72)

Add feedback

505259756244493872b7709a8a01b536-Supplemental.pdf

Neural Information Processing SystemsApr-25-2026, 21:26:31 GMT

artificial intelligence, machine learning, top-5 pseudo-target, (15 more...)

Neural Information Processing Systems

Industry:

Leisure & Entertainment (0.70)
Media > Music (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

939f20cc178460749e4ab5fa28fd3b10-Paper-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 17:12:01 GMT

artificial intelligence, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

Asia > China > Shanghai > Shanghai (0.04)
Africa > Central African Republic > Ombella-M'Poko > Bimbo (0.04)

Genre: Research Report > Experimental Study (0.93)

Industry: Information Technology (0.67)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Vision > Face Recognition (0.95)
Information Technology > Artificial Intelligence > Natural Language (0.93)

Add feedback

505259756244493872b7709a8a01b536-Supplemental.pdf

Neural Information Processing SystemsFeb-8-2026, 15:45:56 GMT

human attention map, initial learning rate, pseudo, (13 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Republic of Türkiye (0.05)

Industry:

Leisure & Entertainment (0.70)
Media > Music (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models

Neural Information Processing SystemsDec-27-2025, 04:38:35 GMT

Despite the remarkable performance of text-to-image diffusion models in image generation tasks, recent studies have raised the issue that generated images sometimes cannot capture the intended semantic contents of the text prompts, which phenomenon is often called semantic misalignment. To address this, here we present a novel energy-based model (EBM) framework for adaptive context control by modeling the posterior of context vectors. Specifically, we first formulate EBMs of latent image representations and text embeddings in each cross-attention layer of the denoising autoencoder. Then, we obtain the gradient of the log posterior of context vectors, which can be updated and transferred to the subsequent cross-attention layer, thereby implicitly minimizing a nested hierarchy of energy functions. Our latent EBMs further allow zero-shot compositional generation as a linear combination of cross-attention outputs from different contexts. Using extensive experiments, we demonstrate that the proposed method is highly effective in handling various image generation tasks, including multi-concept generation, text-guided image inpainting, and real and synthetic image editing.

bayesian context update, energy-based cross attention, text-to-image diffusion model, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.83)

Add feedback

Attention Guided Alignment in Efficient Vision-Language Models

Mahajan, Shweta, Le, Hoang, Park, Hyojin, Farhadzadeh, Farzad, Hayat, Munawar, Porikli, Fatih

arXiv.org Artificial IntelligenceNov-25-2025

Large Vision-Language Models (VLMs) rely on effective multimodal alignment between pre-trained vision encoders and Large Language Models (LLMs) to integrate visual and textual information. This paper presents a comprehensive analysis of attention patterns in efficient VLMs, revealing that concatenation-based architectures frequently fail to distinguish between semantically matching and non-matching image-text pairs. This is a key factor for object hallucination in these models. To address this, we introduce Attention-Guided Efficient Vision-Language Models (AGE-VLM), a novel framework that enhances visual grounding through interleaved cross-attention layers to instill vision capabilities in pretrained small language models. This enforces in VLM the ability "look" at the correct image regions by leveraging spatial knowledge distilled from the Segment Anything Model (SAM), significantly reducing hallucination. We validate our approach across different vision-centric benchmarks where our method is better or comparable to prior work on efficient VLMs. Our findings provide valuable insights for future research aimed at achieving enhanced visual and linguistic understanding in VLMs.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2511.17793

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)

Add feedback

Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models Geon Y eong Park 1 Jeongsol Kim 1 Beomsu Kim 2 Sang Wan Lee 1,2,3

Neural Information Processing SystemsNov-20-2025, 00:33:05 GMT

Since diffusion models require the iterative sampling on high dimensional space, they are computationally expansive and time consuming.

artificial intelligence, bayesian inference, machine learning, (15 more...)

Neural Information Processing Systems

Country:

Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Asia > South Korea > Daejeon > Daejeon (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)

Add feedback

Attentive Feature Aggregation or: How Policies Learn to Stop Worrying about Robustness and Attend to Task-Relevant Visual Cues

Tsagkas, Nikolaos, Sochopoulos, Andreas, Danier, Duolikun, Vijayakumar, Sethu, Kouris, Alexandros, Mac Aodha, Oisin, Lu, Chris Xiaoxuan

arXiv.org Artificial IntelligenceNov-17-2025

The adoption of pre-trained visual representations (PVRs), leveraging features from large-scale vision models, has become a popular paradigm for training visuomotor policies. However, these powerful representations can encode a broad range of task-irrelevant scene information, making the resulting trained policies vulnerable to out-of-domain visual changes and distractors. In this work we address visuomotor policy feature pooling as a solution to the observed lack of robustness in perturbed scenes. We achieve this via Attentive Feature Aggregation (AFA), a lightweight, trainable pooling mechanism that learns to naturally attend to task-relevant visual cues, ignoring even semantically rich scene distractors. Through extensive experiments in both simulation and the real world, we demonstrate that policies trained with AFA significantly outperform standard pooling approaches in the presence of visual perturbations, without requiring expensive dataset augmentation or fine-tuning of the PVR. Our findings show that ignoring extraneous visual information is a crucial step towards deploying robust and generalisable visuomotor policies. Project Page: tsagkas.github.io/afa

artificial intelligence, information, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2511.10762

Country: Europe (0.28)

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.40)

Technology: